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Abstract. Previous authors have shown how to build FM-indexes efficiently in external memory, but 
querying them efficiently remains an open problem. Searching naively for a pattern P requires @(|P|) 
random access. In this paper we show how, by storing a few small auxiliary tables, we can access data 
only in the order in which they appear on disk, which should be faster. 

An FM-index [4J is a compressed representation of a text that allows us to quickly search for 
arbitrary patterns in that text. Their growing popularity in genomics (e.g., in BWT-SW, Bowtie, 
SOAP2 and BWA) means we should look for ways in which they can handle massive datasets, 
which may have to reside in external memory even when compressed. Unfortunately, although we 
know how to build FM-indexes efficiently in external memory [3] , querying them efficiently remains 
an open problem. Searching naively for a pattern P requires @(|P|) random access, which are 
expensive due to seek times. We refer the reader to the papers by Chien et al. [lj, Hon et al. [5] and 
Ferragina [2 J for more discussion of this problem. In this paper we extend a result by Orlandi and 
Venturini [_6j to show how, by storing a few small auxiliary tables, we can access data only in the 
order in which they appear on disk. We may read slightly more data but, since sequential access to 
disk is orders of magnitude faster than random access, our modified index should be faster overall. 

FM-indexes are based on the Burrows- Wheeler Transform (BWT), which permutes the char- 
acters of a string T based on the contexts that follow them. We can compute B = BWT(T) by 
lexicographically sorting the rotations of T, then recording the last character of each rotation. (If 
we want to recover T later, we append a special symbol before computing B or record the position 
to which a designated character is mapped.) For example, if 

T = 110111100101110101010001111 , 

then 

B = 110111011001001011111010110 . 

We use binary strings for simplicity but the results in this paper extend to any reasonable alphabet 
size. Notice that, for any pattern P, the characters immediately preceding occurrences of P in 
T are adjacent in B (considering T to be cyclic). For example, if P = 0101 then the characters 
immediately preceding occurrences of T are T[8],T[14] and T[16], which are mapped to 2? [7], 2? [6] 
and 2?[5], respectively. We call B[5..7] the interval for P = 0101. 

The basic operation of FM-indexes is to find the interval in B for any given pattern P. For 
example, the length of the interval is the number of occurrences of P in T. Notice that the left 
endpoint of the interval is the rank of the lexicographically first rotation of T that starts with 
P, and the right endpoint is the rank of the lexicographically last such rotation. To find these 
endpoints, we store data structures such that, for any character c in the alphabet and any position 
i in B, we can quickly compute the number rank c (i) of occurrences of c in B[l..i]. We also store the 
number C[c] of characters in B lexicographically less than c. 



Suppose we are naively searching for the right endpoint of the interval; finding the left endpoint 
is essentially symmetric. We iteratively compute 



j 1 = r a nk P[lPl] (\B\) + C[P[\P\}\ , 
j 2 = ra nk P[lPl _ 1] (j 1 ) + c\P[\P\-l] 



j 3 = rank P [|p|_ 2] (j 2 ) + C 



P\\P\-2] 



j\ P \ = rankpflto'ipi.x) + C[P[1}] ; 
by induction, j\ P \ is the right endpoint. In our example C = [0, 10], so we compute 

ji = ranki(27) + C[l] = 27, 
j 2 = rank (27) + C[0] = 10, 
j 3 = rank 1 (10) + C[l] = 17, 
j 4 = rank (17) + C[0] = 7. 

Unfortunately, with this method, the sequence of positions for which we answer rank queries can 
be far from ordered and so, with current data structures supporting those queries, we use many 
random access. 

Our idea is to build a series of small auxiliary tables that appear on disk before B, in column- 
major order. Each of these tables stores the answers to rank c queries for each character c, sampled 
at evenly spaced positions. The sample rate increases geometrically from each table to the next. 
For our example we might store two tables in addition to B, 



rank (9) = 2 
rank (18) = 7 



ranki(9) = 7 
ranki(18) = 11 



rank (27) = 10 rank!(27) = 17 



rank (3) = 1 ranki(3) = 2 

ranko(6) = 1 ranki(6) = 5 

rank (9)=2 ranki(9) = 7 

rank (12) = 4 rank^^) = 8 

rank (15) = 6 ranki(15) = 9 

rank (18) = 7 ranki(18) = 11 

rank (21) = 7 ranki(21) = 14 

rank (24) = 9 ranki(24) = 15 

rank (27) = 10 ranki(27) = 17 

Assume our first table is small enough to fit into main memory. We need no rank queries to 
compute ji = 27; nor do we need any to compute j 2 = 10, although that is just because P[4] is 
an occurrence of the largest character in the alphabet. To compute j'3 exactly we need ranki(10), 
which we do not have stored. However, we estimate ranki(10) ranki(18) — (18 — 10) = 3, which 
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leads to the estimate j'3 ~ 13. We then estimate ranko(13) ~ ranko(18) — (18 — 13) = 5, which leads 
to the estimate j'4 ~ 5. 

Orlandi and Venturini [6] pointed out that, if j — t < i < j, then rank c (j) — min(j — i,£) < 
rank c (i) < rank c (j), which means that our estimate j'3 ~ 13 is a lower bound within 8 of the true 
value; in general, our error can be as large as the distance between samples, which is 9 in this case. 
The surprising part of their result is that our error cannot exceed this distance, even after repeated 
estimations using this formula. Therefore, our estimate j'4 ~ 5 is also a lower bound within 9 of the 
true value. 

We now discard the first table and consider what information we want from the second table. 
We know j\ = 27, j'2 = 10, 13 < j'3 < 22 and 5 < j'4 < 14, considering always only the loose error 
bound 9; we want to re-estimate ranki(lO) and ranko(js) in order to re-estimate j'3 and j'4. Therefore, 
we want to read the values ranki(12) = 8 to re-estimate j'3; depending on that re-estimate of j'3, 
we will use one of the values ranko(15) = 6, ranko(18) = 7, ranko(21) = 7 and ranko(24) = 9 to 
re-estimate j'4. We consider where all these values appear in the second table (notice they form two 
consecutive blocks; generally there will be one block for each value we are trying to estimate), sort 
the positions, and then read them sequentially. 

We re-estimate ranki(lO) ~ ranki(12) — (12 — 10) = 6, which leads to the re-estimate j'3 ~ 16. 
We then estimate ranko(16) ~ ranko(18) — (18 — 16) = 5, which leads to the re-estimate j'4 = 5. 
Our re-estimates of j'3 and j'4 are again lower bounds, this time within 3 of the true values. We now 
discard the data we have read from the second table and consider what data we want to read from 
B itself, sort the positions, and then read them sequentially. Details of how we do this depend on 
which data structures we use to support rank queries on B, but now the sequence of positions for 
which we answer rank queries is ordered. 

Calculation shows that, if we increase the sample rate by a factor of r between each table, then 
we use C(log r |T|) tables of total size 0(<t|T| log \T\/r) bits, where a is the size of the alphabet. For 
reasonable values of r and a, this space bound should usually be small compared to B itself, and 
may be reducible with clever encoding of the tables. To find the interval for P, we read a total of 
o(r|P|log 2 |T|/logr) bits from the tables, which is more than what we would read with the naive 
method, but we read them sequentially. 
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